% this file is called up by thesis.tex
% content in this file will be fed into the main document
\ifpdf
\graphicspath{{4/figures/PNG/}{4/figures/PDF/}{4/figures/}}
\else
    \graphicspath{{4/figures/EPS/}{4/figures/}}
\fi


\chapter{From Micro to Macro: How emails contribute to network structure}\label{chap:Chp4} % top level followed by section, subsection

%\begin{quotes}
%We walk the corridors, searching the shelves and rearranging them, looking for lines of meaning amid leagues of cacophony and incoherence, reading the history of the past and the future, collecting our thoughts and collecting the thoughts of others, and every so often glimpsing mirrors, in which we may recognize creatures of information \\
%\attrib{Jorge Luis Borges \citeyearpar{borges1998}   }
%\end{quotes}


\begin{quotes}
In theory, theories exist. In practice, they do not.\\
\attrib{Bruno Latour \citeyearpar[p 178]{latour1988}   }
\end{quotes}


\begin{quotes}
HAMLET (Drawing his sword:) `How now! a rat? Dead, for a ducat, dead!' \\ (Stabs through the arras.) \\
\attrib{William Shakespeare, Hamlet: Act 3, Scene 4 }
\end{quotes}

% ----------------------- contents from here ------------------------

\section{Introduction}
This is the first of two empirical chapters, an exploratory, data driven chapter, whose objective is to test mechanisms that operate at the micro-level of transactions, with effects on the macro-level network structure. One common way to do this is to disaggregate the dataset of emails into groups (similarly to the method described in section~\ref{sec:Chp3Snapshot}.) Each group consists of emails of a certain type. The next stage is to compare between the structural features of the network models in each group. The rationale behind this method is that disaggregation reveals patterns that are concealed by the aggregate network. Just as the aggregate network conceals the timestamps of emails that contributed to the making of specific ties, it also conceals the number of recipients on these emails.

In the current chapter, the email dataset is disaggregated according to the number of email recipients.\footnote{In what follows I use the term \textit{broadcast emails} to denote emails with relatively numerous recipients and \textit{private emails} to denote those with few recipients.} The motivation for this is twofold. First, as shown in section~\ref{sec:Chp3AggAndFilt}, one of the standard ways used by network modellers to filter out the data is to remove emails whose recipient number is above a certain threshold (broadcast emails.) The typical justification cited is that  broadcast emails are likely to represent bulk or spam messages rather than meaningful inter-personal ties. Therefore, they are not relevant for a network model that should represent meaningful social ties. However, it is unclear from the literature what consequences this type of filtering may have for the network structure \citep{zhou2005}. Second, the intuition is that unlike private emails, broadcast emails are used for group discussions, coordination and collaboration. If this is so, we might expect to see different structures associated with private and broadcast emails. 

In the course of the chapter, while analysing and comparing between the disagregated network models,  several unexpected patterns in the data are revealed. Specifically, different emails are found to contribute differentially to the clustering and reciprocity of the email network, with private emails contributing more to reciprocity and broadcast emails contributing to transitivity. Possible 

% mechanisms are discussed that link between transactions and network structure. In an ideal world, one would hope that explanatory links of this sort would take a neat form, exogenous features at one level explaining features at another level. For example, a given structure of organizational relations at the macro- and meso-level may thus explain the pattern of communication transactions \citep{gibson2005}\footnote{In his study of conversations within multi-participant meetings, \citet{gibson2005} demonstrates how exogenously given employment relations (at the macro- and meso- level) affect conversational turn-taking.}. Conversely, sequences of transactions taken as exogenously given, reaffirm relationships or establish new ones at the meso-level. 

%following properties
%\begin{itemize}
%\item Emails can be addressed to multiple recipients.
%\item Email recipients can be assigned to one of three different fields: the \textit{to} field, the \textit{cc} field and the \textit{bcc} field.
%\item Email client software allows recipients of incoming emails (stimulus) to engage in one of three responses: \textit{reply} to the sender, \textit{reply-all} sending a mail to the sender and all co-recipients and 
%\end{itemize}
%

%On a methodological level, the findings highlight gaps between communication datasets and the aggregate network model, paving the way to a central question facing network modellers, namely: given a dataset, what alternatives exist for the construction of network models? What would it mean to say that one method of constructing network models is `better' than another one?

%As shown in the previous chapter, there is already a growing literature dedicated to these problems. A naive approach would be to take each and every communication transaction between any two users $A \text{ and } B$, and to represent it as a connection between the two corresponding nodes in the network. But this problematic for two reasons; conceptually, an act of communication transaction is not the same as a social tie. At the very least a tie between two individuals is associated with multiple transactions (given that the two individuals can interact more than once,) making the link between ties and transactions a $1:N$ relationship. More controversially, since an email sent to multiple recipients can be classified as an action that spans more than one tie, such a transaction activates, at the very least, ties connecting the sender to each of the recipients and possibly also the ties among different recipients. This makes the link between ties and transactions a $N:M$ relation.  

%Practically, the naive approach yields high density network models, with features very different from those commonly found in the social network literature. The reason for this is that some instances of communication are either made by mistake (sending to a wrong address for example), or that they do not signify relevant ties. For that reason we find that network modellers often filter out the datasets, sieving out `signal' from `noise.' Four ways are typically used to limit the density of the network; 1) reducing the time frame for the dataset to include only acts of communication within that time-frame, 2) filtering out transactions that are not reciprocated, 3) setting a lower threshold for the minimum number of transactions of communication by excluding all links in which $A$ contacts $B$ less than a certain number of times, and 4) in the case of emails, excluding `spam' or `bulk' emails, those defined as messages that are sent to more than a certain number of recipients. 

This chapter focuses on the fourth of these issues, the problem of email recipients. Focusing on this particular issue makes the discussion less general and more sensitive to the distinctive properties of the email as a unique form of communication medium. I can think of no other communication medium that consists of a single author and multiple recipients, the recipients themselves organized into three different categories; those listed in the \textit{\textbf{`to'}} category are the immediate addressees of the message. Those who need to be aware of the existence of the message, but need not necessarily act upon it,  are typically listed in the \textit{\textbf{`cc'}} category. Finally, the \textit{\textbf{`bcc'}} category lists recipients who remain invisible to the rest of the actors in the scene, like Polonius eavesdropping behind the curtain to an unfolding drama. Thus, the email is designed to allow for the organization of actors into an impressive array of roles, the senders signalling their expectations for the kind of reactions they hope for from their recipients. Moreover, the email itself is embedded in the context of a sequence of other transactions known as a `thread,' the context that gives the message its meaning, reminiscent of what Erwin Goffman called \textit{the structure of a situation}: a set of constraints and affordances that guide the development of the sequence of communication transactions. The specific nature of the structure of a situation is a key factor that can determine the trajectory of social processes and their outcome on the macro-level\footnote{Raymond \citet[][p 32-35]{boudon1986} poignantly demonstrates this in the context of diffusion of innovations, where the rate of diffusion is a macro-level feature of the system, its change over time hinging on what Boudon calls `the structure of the situation.'}. 

A second reason to pay close attention to the technology is that the interpretation of the data hinges, to a large extent, on the way people use the technology. In a study of mobile communication networks, for example, \citet{kovanen2010} had to split the users of mobile services in his dataset into two different groups, depending on the way they paid for the communication services: prepaid users who pay for usage before making calls and postpaid users who pay afterwards. Different ways of paying for mobile usage is correlated with substantial differences in network statistics between the two groups of users in terms of their degree distribution and reciprocity. To control for this effect, the groups were analysed separately.

Just as in the case of mobile communication data, the study of email data requires careful consideration of the way technology is used. Consider the role of hubs (nodes from which emails are sent to an unusually large number of recipients) and authorities (nodes that receive emails from an unusually large number of senders,) for example. Like typical social networks with directed ties (\textit{arcs}), email communication networks have both hubs and authorities. But \citet{eckmann2004} rightly point out that in contrast to the central roles of hubs and authorities in social networks, in the email graph one should handle them with some suspicion. Authorities might be service desks  whereas hubs could be machines, mass mailers or administrators that distribute organizational announcements, going out to many users. Thus, the importance of hubs and authorities hinges to some extent on whether the object of inquiry should include only thematic issues or administrative ones as well. That said, email hubs and authorities are, of course, important when studying the diffusion of viruses for example. 
Just li directed ties (\textit{arcs}),khavee in the case of mobile communication data, the study of email data requires careful consideration of the way technology is used. Consider the role of hubs (nodes from which emails are sent to an unusually large number of recipients) and authorities (nodes that receive emails from an unusually large number of senders,) for example. Just like in typical social networks, email communication networks feature both hubs and authorities. But \citet{eckmann2004} rightly point out that in contrast to the central roles of hubs and authorities in social networks, in the email graph one should handle them with some suspicion. Authorities might be service desks whose job consists of receiving unusual large number of incoming email, whereas hubs could be mass mailers or administrators that distribute organizational announcements. Thus, the importance of hubs and authorities hinges on the precise object of the inquiry. The workings of administration or service desks might be of less interest when studying certain types of collaborative operations in an organization, but are crucial in understanding the diffusion of software viruses, for example. 

There is a third reason for focusing on the idiosyncrasies of a specific technological medium, and that is to respond to a pressing concern within the community of information systems and organization researchers, a concern articulated in a seminal paper by \citet*{orlikowski2001}. In this chapter, the authors argue that the study of information systems has long been preoccupied with a research program that treats the technology as a black box; either by handling it in very general terms, or by accepting it as a given, deterministic agent of change, ignoring the different modes in which it can be used, regardless of any of its unintended consequences and the context in which it operates. This program, they lament, has contributed precious little to the advancement of a theoretical nature. To correct this sad state of affairs, the authors ask researchers to look into detailed features of technological artefacts and to be sensitive to unforeseen patterns of human-machine interaction.  Focusing on the actual practice of designing, developing, using and interacting with the technological artefact itself, they assert, could be used as a `methodological device,' a point of departure through which researchers could gain greater theoretical insights into the operation of an organization as a whole.

Following these suggestions, this chapter pays special attention to the distinctive features and unique technical properties of email as they are reflected in the datasets, focusing on the consequences of these properties for the network modeller. To this end, the rest of this chapter is divided into three sections. The first section shows that emails are used for different purposes, different types of emails giving rise to networks with different structural properties. This brings us to conclude that the same group of individuals can be mapped into different organizational structures. The second section investigates the implications of the above findings to the study of networks, exploring the issue of degree distribution in social networks. The third section develops two methods to incorporate the number of email recipients into the network model; first, by using a bipartite graph and second by using the strength of ties to reflect a weighted contribution of each email to the tie between email users.

The email communication dataset used in the current study consists of a snapshot taken from the famous Enron corpus \citep{shetty2004}. The chosen period spans the months of September to December 2001, as this was the most dramatic period for Enron as an organization and the most active in terms of the frequency of email exchange \citep{diesner2005}. Despite the concern that this period might represent an exceptional moment from a communication point of view, previous research gives us little reason to believe that this is likely to bias our results in any systematic manner. Moreover, in choosing this period the data could be validated against previous work that focused on the same period of this dataset \citep{Davis2007}.

Standard practices of data cleansing were employed to clean the raw data. For example, when it was established that an individual was using separate email accounts, the accounts in question were united and represented as a single node in the network (as described for example in section $2.1$ of \citet*{chapanond2005}). All emails sent from or to users outside of the organization were discarded. In order to simplify the analysis, no distinction was made in this study between different recipient fields (\textit{to}, \textit{cc}, or \textit{bcc}) . If any recipient was listed more than once in the list of recipients, the user was retained only once. If the sender of an email was also on the list of recipients, the user was removed from that list. Duplicates were identified and removed. The result of this arduous cleaning process was a dataset of $35,964$ emails. The constructed network consists of $9,818$ users interconnected by $68,409$ directed arcs.


\section{Anatomical and Functional Networks}\label{sec:FuncAndAnatomicalNetworks}

The distinction between anatomical and functional networks is adopted from the work on neural networks \citep{shalizi2006}, where subgroups of connected neurons synchronize their activity in a way that depends on the cognitive task at hand, every  task excites a different sub-network of cells. Thus, the same anatomical network is associated with multiple functional ones. In a similar vein, studies of email communication networks \citep{eckmann2004} demonstrate how regions of the network are activated in synchronized  activity, revealing sub-structures that are not apparent from the network of aggregated emails. The same individuals might participate in different  positional roles in the two sub-networks. For example, a central node in one sub-network could be marginal in another. 

Various studies use this method of disaggregating the email dataset into separate groups, forming sub-networks from each group separately. Each study does this for different purposes. \citet{kovanen2010} analyses postpaid and prepaid mobile calls separately, finding that each subnetwork exhibits different structural patterns. \citet{eckmann2004} and \citet{braha2006} disaggregate the dataset into groups of emails sent within relatively short intervals from one another. Both papers report how the sub-structures and positions of nodes in the network vary between one sub-network and the other, the aggregate network deviating in its structure substantially from the structure of each sub-network. 

Continuing in  this tradition, the following empirical investigation disaggregates the email dataset into groups, each group consisting of emails with a certain range in terms of the number of recipients. If broadcast and private emails have different functions, the disaggregated network models represent different functional networks. Two functional networks are depicted in figure  \ref{ImgNetworkConstructedFromSingleAndMultipleRecipient}.
 

\begin{figure}[1.\textwidth]
\centering
\begin{subfigure} {.49\textwidth}
  \centering
  \includegraphics[width=.95\linewidth]{ImgNetworkSingleRecipient}
  \caption{\textbf{Network constructed from single recipient emails.} One recipients per email. Connecting the nodes are $1471$ directed ties (\textit{arcs}), reciprocity is $43.9\%$, transitivity is $16.9\%$. \textit{Notice the open, fan like structures in the network graph}. }
%  \label{ImgNetworkSingleRecipient}
\end{subfigure}%
~~
\begin{subfigure} {.49\textwidth}
  \centering
  \includegraphics[width=.95\linewidth]{ImgNetworkMultipleRecipient}
  \caption{\textbf{Network constructed from multiple-recipient emails}. $20--50$ recipients per email.  Connecting the nodes are $1327$ directed ties (\textit{arcs}), reciprocity is $8.9\%$, transitivity is $31.3\%$. \textit{ Notice the closed triangle like structures in the network graph}. }
%  \label{ImgNetworkMultipleRecipient}
\end{subfigure}
\centering
\caption[Two networks constructed from two types of emails]{\textbf{Two networks constructed from two types of emails.} All emails sent and received between members of a group of $254$ users within the same period of three months, each network constructed from emails with a different range of number of recipients}
\label{ImgNetworkConstructedFromSingleAndMultipleRecipient}
\end{figure}

%Without disaggregating the data, I cannot think of a way for the network model to reflect the difference between, say, sending ten single-recipient email messages, and sending a single email to all ten recipients. Both situations would yield a star formed network with the sender at its centre and the ten recipients connected to it with ties. But intuitively one should recognize that these two types of action are fundamentally different. In the first case, we are dealing with private messages and in the second case we are dealing with a broadcast message. Private and broadcast messages being two types of messages, each sent for different reasons and with different implications, differentially contributing to the structural features of the network as demonstrated in figure  \ref{ImgNetworkConstructedFromSingleAndMultipleRecipient}.

Both models in this figure are based on emails exchanged among  $254$ individuals sampled from the Enron dataset (more about the method of sampling the data below.) The network on the left is based on single recipient emails only. All $254$ users have either received or sent single-recipient emails from others in the group. The network on the right is based on multiple recipient emails, the number of recipients ranging between $20$ and $50$, and again all users have either received or sent emails of this kind from others in the network. All the emails were sent within the same time frame, the months of September to December of 2001. 

The two network models consist of the exact same individuals, sending and receiving emails within the same period of time. Both networks have a similar density, one with $1471$ and the other $1327$ ties, yet the differences between the two networks are striking; compared to the broadcast email network, the level of reciprocity is much greater on the private email network, the level of transitivity much lower. 

The incoming and outgoing degree distributions of the two networks are shown in figure \ref{ImgInOutDegreeSingleAndMultipleRecipient}. Again we see a marked difference between the two networks. All four degree distributions are positively skewed, which means that although most of the degrees are within a relatively narrow range, there are some individuals with an anomously large number of contacts, up to five times more than typical numbers within this range. But despite this similarity, it is easy to identify differences between the networks in terms of the degree distribution.\footnote{Note that some individuals in the broadcast network on the right have low levels of out-degree, some less than the minimal number of recipients on emails. It may well be that some of the recipients are not in the network at all.} Two interesting features stand out when comparing the degree distributions. The first is the difference between the degree distributions of the private and broadcast emails, the former being much narrower than the latter. Also, note the  difference between the indegree and the outdegree distributions of the broadcast emails, again the former being much narrower than the latter. What explains these differences?


\begin{figure}[h]
\centering
\begin{subfigure}[b]{.48\textwidth}
  \centering  \includegraphics[width=1\linewidth]{ImgInOutDegreeSingleRecipient}
  \caption{\textbf{Single recipient email network: }Degree distribution}
%  \label{ImgInOutDegreeSingleRecipient}
\end{subfigure}%
~
\begin{subfigure}[b]{.48\textwidth}
  \centering
  \includegraphics[width=1\linewidth]{ImgInOutDegreeMultipleRecipient}
  \caption{\textbf{Multiple recipient email network: }Degree distribution}
%  \label{ImgInOutDegreeMultipleRecipient}
\end{subfigure}
\caption[In and out-degree distribution of two email networks]{\textbf{Degree distribution of the networks appearing in \ref{ImgNetworkConstructedFromSingleAndMultipleRecipient}}}
\label{ImgInOutDegreeSingleAndMultipleRecipient}
\end{figure}

 
\noindent Let's start with the difference in degree distribution between private and broadcast emails. This is especially curious taking into account that the dataset includes an order of magnitude more private messages than broadcast messages (see table \ref{tab:SevenSubNetworks}). Granted, the number of recipients on each of the broadcast messages is an order of magnitude greater than the number of recipients on the private messages. Hence, these effects should cancel-out, more or less, since the number of sender-recipient pairs is of the same order of magnitude. The differences in network structures suggest that there are different social mechanisms at work for private emails and broadcast ones. The decision to send email messages is governed by a set of interests and norms: private messages are sent to a smaller subset of contacts than public messages, and there are many people with whom contact is materialized \textit{only} in the context of public messages. Like parties in which one wants see and to be seen, broadcast emails seem to realize  ties that are not materialized in private settings, whereas private messages realize relationships that are more intensive in terms of the frequency of transactions, designating perhaps stronger and more meaningful ties. This imputed link between the number of recipients on an email (at the micro-level) and the strength of a tie (at the meso-level) is an issue I shall return to below.  

The second issue is the difference between the distribution of in- and out-degrees of the broadcast emails. Again the sum of all in-degrees 


This suggests fewer senders of broadcast emails than receivers, so that only a subset of the group sends out the bulk of broadcast emails - in that case, that subset would have rather large out-degree, but the in-degree of everyone will be relatively limited to those senders. The different degree distribution could thus reflect two roles existing in the organization - a subset of those who tend to send out messages to numerous other. 

A final test for the similarity between these two networks consists of a quadratic assignment procedure (QAP) \citep{krackhardt1987a} using the Ucinet6 software  for windows \citep{borgatti2002}  with $2000$ permutations yielding an estimated correlation of  $0.23 \left( p \lt 0.001 \right)$, a magnitude that is rather modest, compared for with other QAP correlations found in the social network literature. Consider for example correlations between networks of self-reported relationships and networks reflecting observed interactions between people. When \citet{quintane2011} compare email communication networks to network data elicited through survey procedures, they find a QAP correlation of $0.35 \left(p \lt 0.01 \right)$. Other QAP correlations between observed interactions and self-reported survey data yield values in the range $0.29$ and $0.46$ \citep{quintane2011}. In this context, the correlation we find is rather low, albeit significant. This is not surprising given the descriptive summary statistics presented above, but QAP is an statistical inference method, establishing that these correlations are unlikely to have been the consequences of stochastic effects in the data. 

It is possible to continue the exploration and build Exponential Random Graph Models (\textit{ERGM}\g) as described in \citet{quintane2011}, but an attempt to evaluate such a model using XPNet software \citep{wang2006a} has failed to reach convergence, perhaps because of the relatively large  network. The advantage of using ERGM would have been to point out what type of local structures (such as triangles, symmetric configurations etc.) are unique to each of the two networks, thus confirming through inferential statistics in what ways do the observed networks deviate significantly from random networks. Moreover, ERGM could control for the differences in density when assessing the difference in reciprocity and transitivity. But it is very unlikely that the network density is responsible for the different structures observed above, and this for two reasons: first, compared to the difference in reciprocity and transitivity, the difference in density between the two networks is small ($1471$ vs. $1327$ directed-ties yields about $10\%$ difference in network density), and second, if density alone was responsible for the difference we would expect the more dense network to have both a higher levels of reciprocity and a higher level of transitivity, which is not the case. 



\subsection{A tale of seven networks}
To further generalize the findings about a link between email recipient number (at the micro-level) and network topologies (at the meso- and macro-level,) more networks of the same group of users were constructed and compared. Before presenting the results, I'd like to describe in greater detail how the group of $254$ members of the group were chosen. The aim was to find a group of individuals, all of which are connected to one another by emails with a large range in the number of recipients. 

The users were selected following an unusual and ad-hoc process. The problem was how to find a group of connected individuals, who contacted each other using emails with a large range of number of recipients. To reach this aim, the first step was to identify all users who sent or received emails with $30$ recipients or more. From the original $9,818$ users $503$ users were left. From this group, a subset of users was chosen such that all group members participate in single-recipient email transactions. Thus, each member in the new subgroup satisfies the following conditions: a. the member has been in contact with another member via a single-recipient email, and b. the same member has also been in contact with another via an email with 30 recipients or more. 

From this subset of users, an even smaller subset was chosen so that all members could be connected by emails of $2-3$ recipients. This procedure was repeated for different ranges of recipients, resulting in the above mentioned group of $254$ individuals. Aggregating all emails sent and received within the specified time period, an aggregate network of density $0.08$ was formed, its level of reciprocity reaching $0.41$ and a global clustering coefficient of $0.44$.

At this stage,  all emails sent and received within the group of $254$ were grouped according to the number of their recipients. The first group included all the single-recipient emails. This group had the largest number of emails. The second group included all emails with two or three recipients and so forth. A network was then constructed from each of the seven groups described in table \ref{tab:SevenSubNetworks}. 

\begin{table}[htb]
	\renewcommand{\arraystretch}{1.2}
	\centering
		\begin{tabular}{l|ll}
			\hline
			 \noalign{\smallskip}
			  	\parbox[t]{2.5cm}{\footnotesize \# of Recipients} & \parbox[t]{2.5cm}{\footnotesize \# of Emails} & \parbox[c]{4cm}{\footnotesize \raggedright \# of unique sender-recipient combinations } \\
			  \noalign{\smallskip} 
			  \hline 
			  \noalign{\smallskip}
				$1$		 & $6,140$ & $1,471$ \\
				$2-3$	 & $2,017$ & $1,292$ \\
				$4-7$	 & $956$   & $1,325$ \\
				$7-13$	 & $87$    & $1,266$ \\
				$13-21$	 & $353$   & $1,215$ \\
				$20-50$	 & $436$   & $1,327$ \\
				$40-285$ & $318$   & $1,278$ \\				
			\hline
		\end{tabular}  
	\caption{Seven sub-networks}
	\label{tab:SevenSubNetworks}
\end{table}
 
\noindent The range of the number of email recipients in each group was chosen in such a way so that the resulting networks would be comparable in terms of their density. For example, the first network is based on emails with single recipients sent between members of the group. $6,140$ such emails were exchanged, but only $1,471$ had a unique combination of sender and recipient. Thus a directed, non weighted network was constructed with exactly $1,471$ directed ties connecting all the $254$ members of the group. The second network was based on emails with $2-3$ recipients only. $2,017$ such emails were exchanged constituting $1,292$ distinct ties connecting the same $254$ users. Seven networks were thus constructed based on the same $254$ nodes. Each network contains a comparable number of ties (around $1,200$ to $1,400$ ties.)

The first and sixth of these networks were compared in the first part of this section. Let us now compare between the first and the second network, expecting that these two will be much more similar. To make the comparison,  another QAP was carried out  between the network based on single recipient emails and the one based on $2 \text{ to } 3$ recipients. As before, the QAP procedure was conducted using the Ucinet6 for windows software \citep{borgatti2002}  with $2000$ permutations, yielding an estimated correlation of  $0.49 \left( p \lt 0.001 \right)$. This is a decent correlation, as far as QAP of social networks go, and we find that  the similarity between the first and the second network is much greater than the similarity between the first and the fifth network, where the estimated correlation using QAP was only $0.23 \left( p \lt 0.001 \right)$.

Three measures were calculated for each of the seven networks; transitivity was measured via the global clustering coefficient, which is the proportion of closed \textit{triplets} to the total number of triplets in the network \citep{opsahl2009}. \\

$C = \frac{\mbox{number of closed triplets}}{\mbox{total number of triplets}} = \frac{3 \times \mbox{number of triangles}}{\mbox{total number of triplets}} $ \\

\noindent Where a triplet is a group of three connected nodes. For example, if $A$ is connected to $B$ and $C$, the three nodes are considered a triplet. A distinction is made between an open triplet (in which $B$ and $C$ are  not connected.) and a closed one ($B$ and $C$ are connected.) For obvious reasons, every triangle is considered to be three closed triplets. Thus, star forms with a centre and $n$ edges have $n\left(n-1 \right)/2 $ triplets,  all of which  are open, the clustering coefficient of the graph is hence zero. A complete graph with $n$ nodes has $n\left(n-1 \right) \left(n-2 \right)/2 $ triplets, all of which are closed and hence a clustering coefficient of one. 

Reciprocity was measured in two ways: the first was a straightforward method, the proportion of symmetric ties to all ties. As an additional measure of reciprocity, each network was compared with its transposed matrix, and the Pearson correlation coefficient was calculated using QAP as described above. The results are presented in figure \ref{ImgRecipQAP} and figure \ref{ImgRecipCluster}.


\begin{figure}[h]
		\centering
		\includegraphics[width=.6\textwidth]{ImgRecipQAP}
		\caption[Comparing between two reciprocity measures]{\textbf{Comparing between two reciprocity measures} - Reciprocity and a QAP procedure of the network matrix with its transposed as a function of the number of recipients in the emails connecting 254 individuals}
		\label{ImgRecipQAP}
\end{figure}


\noindent The first result is therefore a confirmation and expansion of the finding established in the first part of this section, namely that emails with an increasing number of recipients contribute a decreasing proportion of reciprocated ties of the total network. Both measures of reciprocity point to the same trend, where the proportion of reciprocated ties decreases with the increase in the emails' recipient number.\footnote{The disaggregation could potentially lead to an underestimation of reciprocity. For example, consider a situation where two users exchange emails, in one direction all emails are broadcast emails and in the other direction all emails are private ones. In such a situation, the tie between both users would be reciprocal in the aggregate network but would not be reciprocated in the each the disaggregated networks. However, it is unlikely that this effect is substantial as the level of reciprocation of the single-recipient email network is very close to the reciprocation of the aggregate network. It follows that the number of ties whose reciprocation has been severed through disaggregation remains relatively low.} There are several possible explanations for this trend, as will be discussed in greater detail in the following chapter. But for now, one might want to consider an explanation that is grounded in email etiquette - it might be considered more acceptable to ignore broadcast messages and not to reply to them, whereas it might be less acceptable to do so with a private email. In fact, not replying to private emails could be regarded as rude, whereas broadcast emails may be seen as a nuisance to be ignored, which is why they are sometime referred to as spam. Thus, norms that guide behaviour at the micro-level could have implications at the aggregate level connecting single transactions to their collective outcome at the aggregate level of the communication network.  

\figuremacroW{ImgRecipCluster}{Reciprocity and clustering measures}{Reciprocity and the clustering coefficient as a function of the number of recipients in the emails connecting 254 individuals}{.5}


\noindent The second finding is that emails with an increasing number of recipients contribute a proportion of closed triplets which increases at first and subsequently decreases. This could be explained if each email delineates a bounded group of recipients, and if relationships among recipients of the same email are more likely to occur than relationships between recipients of different emails. If this is the case, larger recipient lists have the potential to create greater proportions of closed triplets. At first, this potential is realized, explaining why bigger groups contribute increasing levels of triplet closure. Above a certain threshold, emails sent to a great number of people make them less likely to know each other, explaining the waning proportion of triplet closure. 


In what follows, I'd like to confirm the hypothesis that if a focal node sends an email to two recipients, they are more likely to connect to each other directly, compared to having received separate emails. In other words, being co-recipients is a stronger indication of tie formation two recipients of distinct emails from the same node. To test this proposition, all ties were divided into three types of what may be called \textit{co-citation} categories. Given a combination of two nodes $A$ and $B$, this \textit{dyad} is said to be co-cited if both nodes have had email contact to a third node. Consequently, there are three types of co-citation: \textit{strong co-citation} consists of dyads, whose nodes are co-cited in a single email. This means that there exists at least one email sent from some third node $C$, in which both $A$ and $B$ are co-recipients. In contrast, \textit{weak co-citation} consists of dyads with co-cited nodes, but not even one email could be found in which $A$ and $B$ are both recipients. In other words, despite existing the existence of at least one node $C$ sending emails to or receiving then from  $A$ and  $B$, no email was sent from $C$ that addresses both $A$ and $B$ concurrently. The third group consists of dyads which were not co-cited at all. This means that there exists no node $C$ sending emails to or receiving them from $A$ and $B$. All connected dyads in the seven networks were classified according to these three categories, the results of which are represented in figure \ref{ImgCoCitationTies}.


\figuremacroWH{ImgCoCitationTies}{Classification of ties into categories of co-citations}{All ties of the seven networks of table \ref{tab:SevenSubNetworks} were classified into three types of co-citation}{.5}

In all seven networks, about $20\%$ of the connected dyads are not co-cited at all. The nodes associated with these ties share no common acquaintances. They could be bridges connecting disparate groups, or else one or both of the nodes may have no other connections. Moreover, by the very definition of co-citation, networks based on single-recipient emails can have dyads co-cited in a weak form only. This is simply because having co-citation of the stronger form would require there to be emails with more than one recipient. 

The interesting feature of figure  \ref{ImgCoCitationTies} is that the more recipients in the emails, the greater the proportion of strong co-citations relative to weak ones. This suggests that as we move from more private to more broadcast emails, ties are more likely to be strongly co-cited. Now, this could of course represent the actual distribution of co-citations of the dyads themselves. Maybe we see more strong co-citations because a greater percentage of the dyads are strongly co-cited. To investigate this point a bit further, all co-cited dyads in all seven networks were classified into two forms of co-citations - strong and weak co-citations, sown in figure  \ref{ImgCoCitationDyads} 

\figuremacroWH{ImgCoCitationDyads}{Classification of all co-cited dyads into two categories of co-citations}{All co-cited dyads in the seven networks of table \ref{tab:SevenSubNetworks} were classified into strong and weak co-cited dyads}{.5}

This figure shows clearly that despite increase in the number of email recipients, there remains a substantial proportion of weakly co-cited dyads, at least $20\%$ but in all but one network $40\%$ or more. Thus, it seems unlikely that the connected and co-cited dyads are  merely a random sample of all co-cited dyads, since, if they were, we would expect to see many more weakly co-cited ties. The conclusion of this finding is that an email sent to two or more people is a strong indication that these two people themselves are connected. The causal link here is of course unknown - it is possible that a mail was sent to these people because the sender knows that they are connected, and it is possible of course that the mail itself is the medium by that prompted the connection between the two. 

Be it as it may, the finding can be seen as an expansion on the principle of transitivity known from social networks; yes, the friends of my friends are likely to be my friends. But this likelihood increases greatly, if my friends interact with their friend \textit{and} with me at the same time, via an email sent concurrently to both of us. Here the key is to highlight the importance of the structure of interactions, over and above the existence of networks of human relationships \citep{feld1981}. The tendency towards transitivity at the level network topology is important of course, but the theoretical principle of micro-foundations encourages us to seek the explanations for this tendency at the order of interactions. 

\figuremacroWH{ImgInOutDegreeSevenNetworks}{Degree distribution of seven disaggregated networks}{Comparing the in- and out-degree distributions, notice how the distribution becomes wider moving from private to broadcast emails. Also note that the differences are more pronounced in the out-degree distribution, indicating, perhaps, a separation of roles within the organization}{.7}

\noindent The third finding presented in figure \ref{ImgInOutDegreeSevenNetworks} has to do with the distribution of in- and out-degrees of the seven networks. Though the distributions of in-degrees is relatively consistent across the private-broadcast spectrum of networks, the disparity between the in-degree and out-degree widens the more we move from private email networks to broadcast ones: whereas in private emails both in- and out-degree distributions maintain a relatively narrow (though positively skewed) distribution, in broadcast emails we observe a group of people who send out emails to a large number of recipients, though many receive emails from a relatively small number of others, many of whom do not bother to reply in the form of broadcast emails. They receive the emails but do not participate in a discussion. The third finding chimes with the asymmetric nature of broadcast emails on the one hand, on the other making a case for a partition of roles, a distinction between those who participate in the dissemination of multi-recipient emails and those who do not. 

%\figuremacroW{ImgRecipCluster}{Reciprocity and clustering measures}{Reciprocity and the clustering coefficient as a function of the number of recipients in the emails connecting 254 individuals}{.8}





% given two models, how do we know that one is better than the other? 
% this serves as a gap between traditional methods of data collection and digitally mediated forms - data collection techniques and stuff vs. networks...


%In the previous chapters I argued that the study of social networks today increasingly relies on the analysis of computer-generated streams of data, the very traces left by events which are interpersonal, ephemeral, directed, mediated and meaningful in a certain social context \citep{Monge2003}. But social contact is not the same missing from this data is an indication of a durable social bonds of obligation and commitment between individuals. This type of data was rarely available in the past, and its fleeting nature limited its theoretical appeal for those who study durable social networks. Consequently, the availability of of raw data of this kind brings about a new challenge, the need to reconcile the gap between the structure of the data and the structure of social networks models \citep{Butts2009}.
%Communication data is rarely generated for the exclusive purpose of network analysis, more often it is collected for auditing purposes of a more technical nature, and consequently, network researchers have little control over the design of the output data and what it denotes. To analyze the data from a network perspective, one must first transform them into network models that are amenable to conventional methods of network analysis. This is because network models consist of nodes representing stable entities and ties representing stable, dyadic relationships, whereas the data consist of multiple and related interactions cascading from one situation to another and connecting individuals, groups, themes and digital objects. This problem is related to what is known in the literature as the gap between interaction approaches and social network analysis \citep{Gibson2005}.
%Previous studies of email networks addressed this problem by creating network data models in two stages: construction and filtering. Construction proceeds by designating email users as nodes of the network. Any two nodes are connected by a directed tie if an email is sent from one node to the other. As a consequence, each email is represented by an ego-centric star configuration with the email's sender at the center of the star and its recipients at the other end of each edge. Star configurations are then aggregated to construct the final network. To understand what information is lost in this process, consider the difference between sending a single email to n recipients and sending n private emails, each to one recipient. Though there are reasons to suspect that people will react differently to `public' and to `private' emails, these two cases become indistinguishable in the network model because of the way it is constructed. However, retaining this distinction may be used to explain why some connections materialize while others do not.
%Constructing network models typically requires filtering out noise. Ties or emails are discarded if they are deemed irrelevant. This is often done choosing thresholds in an ad-hoc manner. Ties may be discarded if they are not symmetric, or if their throughput falls below a chosen threshold. Emails are typically discarded if they are suspected to be bulk emails because those too are not seen to represent `real interpersonal' relationships \citep*{Kossinets2006, Tyler2005}. The filtering stage raises further concerns of lost information when taking into account that recipient lists are rarely an arbitrary collection of individuals \citep*{Zhou2005}, and that even in bulk emails they may delineate meaningful organizational units.
%Taken together, the findings of this chapter question whether emails merely indicate underlying, already existing patterns of inter-relationships. Rather, they prompt us to view communication technology as a tool users employ to catalyze the formation of new ties or the fortification of old ones. The specific usage has an effect on the way recipients become aware of opportunities or judge the expectations of others, leading to communicative action that gives rise to certain structural patterns. Thus, the findings presented here can inform the study of the localized social processes that give rise to and sustain networks. These processes are precisely the ones that lie at the very heart of the study of social networks, processes such as homophily, popularity mechanisms and cognitive balance \citep*{Snijders2006}.
%The rest of this paper evaluates the potential utility of the information that is lost in the process of network construction and filtering, and proposes ways to introduce it back into the network model. In the following section, patterns of email usage are presented to motivate the idea that emails delineate meaningful groups of related individuals. The third section argues that structural properties of networks depend on the number of recipients of the emails which constituted them. Moreover, individuals are found to interact differently depending on whether they do so in more private or in more public settings. In the last section, an attempt is made to identify the social mechanisms that explain the formation of network structures, thus contributing to the long standing debate between social interaction approaches and social network analysis \citep*{Butts2008, Gibson2005, Mische1998}.

\section{Fat tail distributions in email communication networks}\label{sec:Chp4Distributions}

% Several dimensions to the artifact
% Sender receiver relationships- some media are one to one, some one-to-many, some many to many.
% Synchronicity
% Time modality - Asynchronous (Email, FB-Wall) to semi-Synchronous
% Level of privacy
% Network shape - Go whether you are
% Multiplexity - Visual, Oral, Texual communication.
% What I think when i I n
\subsection{Deriving the reverse...}
When \citet{newman2001a}  studied  the production of scientific articles he reproduced and developed a pattern first published by \citet*{Lotka1926}, known today as Lotka's law. The law states that the number of publications per author is distributed as a power law, as expressed in equation \ref{FrmPower}. This distribution determines the probability that a given author would publish any number of times. An important characteristic of the distribution is that it is a highly skewed one, especially when $\alpha$ is small, a 'fat tail' distribution in which `typical' values cover a very large range indeed, usually several orders of magnitude. This characteristic is very unusual for many distributions found in nature and in the social sciences, those known as `normal' distributions, most of their values centring around a relatively limited range of `typical' numbers. Variables distributed normally include human weight, for example, the rate of suicide in a community, disease etc. Human height, for example, typically ranges between 1.5 meters and 2 meters. A height of dozens of meters is simply unthinkable. In contrast, the number of publications penned by a single author can range from one to a many of dozens. Power distributed variables do not exhibit a typical number or range, and their variance often very large and their mean of little use or relevance.
\begin{equation}
p(x) \propto x^{-(\alpha+1)} %\quad \mbox{for $\alpha > 0$}	
\label{FrmPower}
\end{equation}
This is of course an approximation, since $x$ represents a random variable, representing the number of papers published by an author, the number of recipients in an email, the number of co-authors of a publication etc. All these variables are both discrete and bounded. Treating a discrete  variable as if it were continuous is a common practice and, in this case, does not present a mathematical difficulty. However, the boundedness of the variable is of consequence to the normalizing constant. Thus, without loss of generality, the normalizing constant $C$ in the following equation can be calculated.
	\[ p(x) 	= Cx^{-(\alpha+1)} \quad \mbox{for $x \in (a,b)$} \quad \mbox{ $\alpha > 0$}	\]
Where the normalizing factor $C$ can be calculated by integrating the distribution over the range $x \in (a,b)$ and equating the result to one. The normalized distribution function is now presented in \ref{FrmNormPower}.\\

\begin{equation}
p(x) = \frac{\alpha}{a^{-\alpha} - b^{-\alpha}}x^{-(\alpha+1)}
\label{FrmNormPower}
\end{equation} \\

%\\
%\int_a^b{p(x)\;dx} &= \int_a^b{Cx^{-(\alpha+1)}\;dx} \\
%									 &= -C\frac{x^{-\alpha}}{\alpha}\Bigg|_a^b \\
%									 &= C\frac{a^{-\alpha} - b^{-\alpha}}{\alpha} = 1 \\
%									 p(x) &= \frac{\alpha}{a^{-\alpha} - b^{-\alpha}}x^{-(\alpha+1)}
%\label{FrmPowerNormalizing}
%\end{align*}

\noindent One way to explore the power distribution is to simulate it using the inverse of a cumulative distribution function (CDF). This is perhaps the easiest methods for sampling from a given distribution, a method described for example by \citet*[p 153]{Jackman2009}. Consider a sample needed for a general distribution $p(x)$, where $x \in (a, b)$. The CDF is defined such that $CDF(u)=\Pr(X \leq u)=\int_{a}^u{p(x)\;dx}$, $CDF$ being a function that maps from $(a,b)$ unto the unit probability interval $CDF:(a,b) \rightarrow (0,1)$. Using the following algorithm, it is relatively straightforward to sample from $p(x)$ provided that the inverse CDF exists such that $CDF^{-1}:(0,1) \rightarrow (a,b)$, and provided that it is a computable function.
\begin{program}
\mbox{Inverse CDF sampling algorithm:}
\BEGIN %
\FOR t:=1 \TO T \DO
 	   \texttt{sample}\ \  u^{(t)} \sim UNIFORM(0,1)
		 x^{(t)}\leftarrow CDF^{-1}(u^{(t)}) \OD
\END \FOR
\end{program}

\noindent We can now develop the inverse cumulative function $CDF^{-1}$ associated with \ref{FrmNormPower}. First, the $CDF$:
\[
	CDF(x) = \frac{\alpha}{a^{-\alpha} - b^{-\alpha}}\int_a^x{\hat{x} ^{-(\alpha+1)}}\;d\hat{x}
		   = \frac{a^{-\alpha} - x^{-\alpha} }{a^{-\alpha} - b^{-\alpha}}
\]

\noindent Reorganizing and replacing $CDF(x)$ by $u$, a uniformly distributed random variable ranging between $0$ and $1$, yields:
	\[
		x = \left[a^{-\alpha} - u(a^{-\alpha} - b^{-\alpha}) \right]^{-1/\alpha}
\]
\noindent Now all we need to do is to sample from the uniform distribution, yielding a power distribution on $x$. Notice that $x$ is a monotonous increasing function of $u$, approaching $a$ ($b$) as $u$ approaches $0$ ($1$). Plotting the power distribution $x^{-1.5}$ defined on the range $(1,100)$, figure \ref{ImgPowerDist} shows the exact distribution function (DF) as in equation \ref{FrmPower} and a histogram of a simulated sample of this distribution using the inverse CDF method described above. One problem which remains is how to fit of this distribution given the data alone. To address this problem I shall now introduce the reverse CDF function.

\figuremacroWH{ImgPowerDist}{Sampling from a power distribution}{A power distribution with exponent $1.5$,  $(\alpha=.5)$ and valid values of $x$ ranging between ${.6}1$ and $100$}{.5}

\noindent The problem of course is that we have a distribution that seems to follow a power law, and the objective is to fit and estimate $\alpha$. Several methods of fitting a power law are described in \citet{newman2005}, methods based on taking the $\log$ of both sides of  equation \ref{FrmPower}, making it equivalent to $\log p(x) \propto -(\alpha+1)\log x $, a linear relationship with slope $ -(\alpha+1)$ on the log-log scale. Consequently, fitting a power law to a distribution amounts to regressing the logarithm of the observations against the logarithm of frequencies, the estimated slope of the curve being $\alpha+1$. A diagram of the power distribution with exponent $1.5$,  $(\alpha=.5)$ approaches a straight line in \ref{ImgPowerDistLogLog}. However, as the figure shows, there is a difficulty with this approach, namely that for large values of $x$, the probability for an observations is low. This results in a noisy curve on the  the tail of the distribution, a region where the power-law function dwindles and the number of observations is small.

One method of dealing with the large fluctuations at the tail is to bin the observations, either by having all bins equal in size, or by increasing the size of the bins where observations become  scarce. Increasing the range of the bins decreases their `resolution' but allows each bin to capture observations with greater probability. An alternative method to binning  is to use the reverse cumulative distribution function ($CDF_{reverse}$), where the \texttt{y-axis} denotes the probability of observing a random variable equal to or greater than the the observation in the \texttt{x-axis}. The reverse CDF would then be expected to be.


\begin{align*}
	CDF_{reverse}(x) & = \int_x^b p(\hat{x}) \; d\hat{x}    \\
%									 & = \frac{x^{-\alpha} - b^{-\alpha} }{a^{-\alpha} - b^{-\alpha}}\\
			 & = \frac{a^{\alpha}}{b^{\alpha} - a^{\alpha}}\left[  (b/x)^{\alpha} - 1 \right]\\
\end{align*}
\noindent When $x$ is at its minimum permitted value, $CDF_{reverse}(x=a)=1$, since all observations are above that minimum value. By the same token $CDF_{reverse}(x=b)=0$, since none of the observations are greater than the upper boundary of the distribution. We saw above that the $log$ of the DF was perfectly linear. We shall now see, that $CDF_{reverse}$ is approximately linear.

\begin{align*}
	\log(CDF_{reverse}(x)) & = \log\  \frac{a^{\alpha}}{b^{\alpha} - a^{\alpha}}  + \log\left[ (b/x)^{\alpha} - 1\right] \\
			& \approx  \log \frac{a^{\alpha}}{b^{\alpha} - a^{\alpha}} + \log(b/x)^{\alpha} \quad \quad \mbox{for $(b/x)^{\alpha} \gg 1$}	 \\
			& = \log  \frac{a^{\alpha}b^{\alpha}}{b^{\alpha} - a^{\alpha}}   - \alpha\log x			
\end{align*}

\noindent Note that the approximation works best when $b \gg x$ and when $\alpha$ is large compared to zero. When these conditions hold, the log-log graph of $CDF_{reverse}$ is approximately a straight line, just like the log-log graph of the $DF$. But unlike the $DF$, each point on the graph represents not only one, but several observations, minimizing the noise on the tail of the distribution.

Both the distribution function (DF) and the reverse CDF are plotted in \ref{ImgPowerDistLogLog}. The figure clearly demonstrates that the sample of the distribution function is much noisier than the cumulated distribution. Moreover, as long as $x,a\ll b$, the $CDF_{reverse}(x)$ follows a linear function. If instead of the density distribution, we were to plot the cumulative density distribution $P(x)$, we would obtain a more stable log-log plot with a slope of $-(\alpha - 1)$, as can be seen from the following derivation:

\figuremacroWH{ImgPowerDistLogLog}{Sampling from a power distribution}{A power distribution with exponent $1.5$,  $(\alpha=.5)$ and valid values of $x$ ranging between ${.6}1$ and $100$}{.5}

In other words, just like the original density distribution, the cumulative distribution function $P(x)$ follows a power law, but with a different exponent which is 1 less than the original exponent. Thus, if we plot $P(x)$ on logarithmic scales we should again get a straight line, but with a shallower slope. However, thanks to the cumulative nature of the data, we would not expect the graph to bounce to zero every time no count has been observed.


\subsection{Applying....}
If \citet{Lotka1926} and \citet{newman2001a,newman2001b} studied the production of books and academic publications, we are now in a position to study distributions in email datasets. Do people produce emails in a distribution that resembles the production of books and articles? But emails have different properties than books. In fact, we can now study not only the distribution of email production, but also the distribution of senders and recipients. Thus, let us define email production as the number of emails sent by any single user. Email consumption can now be defined as the number of emails received by any single user. Email dissemination would be the number of recipients per email.

\figuremacroW{ImgRecipMessage}{Dissemination: The reverse cumulative distribution of email recipients}{Approaches a power distribution over two orders of magnitude with $\left( \alpha+1\right)  = 1.86$}{.5}

Note that all three distributions approach the value one close to the minimal value of the random variable, which is the nature of the reverse cumulative distribution. The dissemination distribution seems to approach a power distribution over two orders of magnitude, with an almost precisely linear curve from beginning from 1 recipient per email all the way to almost 100 recipients per email  ($ \alpha+1  = 1.86$), whereas the consumption and production shown in figure~\ref{ImgProductionConsumptionDistLogLog} are not very close to a power distribution, though they too exhibit a rather fat tail.  It may seem surprising at first that consumption and production are differently distributed, with production falling much more steeply at first ($ \alpha+1  \approx 2.01$) than consumption ($ \alpha+1  \approx 1.66$). But recall that the total number of emails consumed is greater than the total number produced. That is because broadcast emails are produced once but consumed by potentially multiple recipients. This becomes perhaps intuitive when one notices that on average, in-boxes contain more mail than out-boxes. This could not have been the case if all emails were single-recipient, private emails.

\figuremacroW{ImgProductionConsumptionDistLogLog}{Production and Consumption: The reverse cumulative distribution of emails produced and received}{Does not seem to approach a power distribution but is highly skewed over two orders of magnitude with $\left( \alpha+1\right)  = 2.01 \text{ and } 1.66$ for production and consumption respectively.}{.6}


To compare these numbers, \citet{Lotka1926} estimated the distribution of the number of papers published per author to follow a power law with about $ \alpha+1  \approx 2$. \citet{newman2001a} also finds a power law in two databases of academic publication with $ \alpha+1  = 2.86 \text{ and } 3.41$. These results are much steeper than the results for email, but that is probably because the production of emails is so much easier than the production of a scientific publication. 


%-------------------------
%Additionally, we can  now study not only the distribution of email production, we can also study the distribution of senders and recipients. Thus, let us define email production as the total distribution of the number of emails sent by any single user. Email consumption can now be defined as the distribution of the number of emails received by any single user. Email dissemination is now the number of recipients per email. 

%This may seem surprising at first, but figure \ref{ImgProductionConsumptionDistLogLog} makes it clear that the number of messages sent is much smaller than the number of mails received. This is because all emails are sent from a single account, but some can land in more than one inbox, specifically when sent to more than one recipient. This becomes perhaps intuitive when one notices that people's in-boxes tend to be contain more mails than people's out-boxes.





%Both email production and consumption are found to be highly right-skewed distributions spanning four orders of magnitude. The finite time window makes it impossible for these distributions to perfectly fit a power law. However, it is possible to fit the data with a power law with an exponential cut-off as described by \citet{newman2001a}. At least the first two orders of magnitude of the distributions nicely fit power laws with exponents $-2.01$ for email production and $-1.66$ for email consumption. This result may be seen as a generalization of Lotka's law for email use. Moreover, dissemination was measured as the distribution of the number of recipients per email (see figure \ref{ImgRecipMessage}). Fitting a power law to the first two orders of magnitude of this distribution yields an estimate of the exponent to be  $1.86$.



This result is interesting for several reasons. In terms of human production of intellectual or symbolic resources, Lotka's law has been tested numerous times in the past for the production of co-authored papers \citep{newman2001b} and the production of open source software \citep*{Newby2003}. However, to the best of my knowledge, this is the first reported attempt to extend Lotka's law to patterns of production and consumption of email communication. 

Moreover, email production and consumption seem to be compatible with Lotka's law regarding the production of scientific journal articles. This means that many email users produce and consume a relatively modest number of emails, but also that there is a significant number of users who act as hubs of email production and consumption. The comparison between the production of emails and the production of scientific articles suggests that methods and theories developed for the analyses of networks of scientific collaboration  \citep{newman2001b, newman2001a} could be tested on networks of email communication. Note however, that there is a fundamental difference between production (of emails or papers) and the consumption of emails. In the first case the choice of authoring a paper is done by the author; email consumption is a choice of a different subject than the one observed, namely the senders of the emails. This may have theoretical implications when generalizing Lotka's law to email consumption.

Finally, the distribution of dissemination number shown in figure \ref{ImgRecipMessage} suggests that there exists a preferential attachment of recipients to emails. Groups of email recipients seem to confirm power law distributions found in the sizes of cities, organizations and other social groups or entities \citep*{Adamic2000, Adamic2002, newman2005}. The extent to which an email's list of recipients also delineate meaningful organizational or functional units is further explored in the next section.

\section{Calculating the Strength of the Tie using the number of recipients }
In the previous section the email dataset was described along dimensions relating to email usage: production, consumption and dissemination. This analysis, it was argued, could motivate further exploration of the relations between email messages and the groups they circumscribe.



%TODO mention the backbone article

% How it is unique in emails
% Future work and expansion

%Special care should be taken when constructing network models if one aims to capture more value from communication datasets. This is because the decisions modellers make (e.g., thresholds, noise filtering techniques) may have an impact on the properties of the resulting network. For example, discarding emails with multiple recipients could perhaps yield networks with underestimated levels of clustering. Moreover, extracting dyadic relationships from emails conceals important information about the dependencies between these ties. One way to overcome this is to translate the number of recipients into the weight of the ties in the network as done in section 3.1. A related observation is likewise drawn from the literature of scientific collaboration networks. These are sometimes studied as two-mode networks, where authors are one mode and scientific articles are the second mode. By the same token, users and emails can be viewed as two modes in a directed two-mode network, albeit a rather unusual one: while most two-mode networks are not directed, in this case directed ties connect individuals to emails (when sending) and emails to recipients (when receiving). Still, by using a two-mode approach more information from the original dataset could be represented in the network models, and theories and methods developed for two mode networks can be applied to further investigate networks of communication.




Does the number of recipients in an email give an indication to the kind of relationship existing between the sender and each recipient? One conjecture could be that two users are more strongly tied if they exchange `private' emails, in contrast to more weakly tied nodes which exchange only multi-recipient messages. A possible explanation could be that sending an email to fewer recipients creates a stronger obligation on each of the recipients to reply. Thus networks formed from private emails may have a higher level of reciprocity, perhaps consisting of stronger ties.

Emails with fewer recipients may lead to higher levels of reciprocity, but do they also signify stronger ties? A claim along these lines has been made in the context of networks of scientific collaboration \citep*{newman2001a, Borner2005}. For example, \citet*{newman2001a} claims that `it is probably the case [\ldots] that two scientists whose names appear on a paper together with many other co-authors know one another less well on average than two who were the sole authors of a paper. To account for this effect, Newman let each co-authored paper contribute a certain weight to the valued tie connecting each author to each of the other co-authors. This weight is inversely proportional to the number of those co-authors, so that if a scientist collaborates with $n-1$ other co-authors, on average, that scientist is acquainted with each of them $\frac{1}{n-1}$ times as much as if he or she were collaborating with just one coauthor.
This idea could be easily adopted to email communication datasets. Since all the ties extracted from email $k$ relate the sender $i$ with recipients $j=1, 2 \ldots n_k$, we would expect the weight of the directed tie between sender $i$ and receiver $j$ to be given by equation \ref{FrmWeightedRecipients}.

\begin{equation}\label{FrmWeightedRecipients}
w_{ij} = \sum_k\frac{\delta_{ij}^k}{n_k}
\end{equation}
%\begin{equation}\label{prior}
%\lambda_{i} \sim \textrm{Gamma}\left(aX_{i}^{(t-1)},a\right) ,
%\end{equation}

\noindent Where the summation is over all emails $k$, and $\delta_{ij}^k$ contributes to the sum only if user $i$ sent email $k$ to recipient $j$. Consequently, it is defined as follows,

\[
\delta_{ij}^k = \Biggl\{
\begin{array}{ll} %
		1 &  \quad \mbox{if user $i$  sent email $k$ to recipient $j$} \\
		0 &  \quad \mbox{otherwise} \end{array}  %\right.
\]

\begin{comment}

% ctrl+ Q
\[f(n) = \left\{
\begin{array}{l l} n/2 & \quad \mbox{if $n$ is even}\\
  -(n+1)/2 & \quad \mbox{if $n$ is odd}\\ \end{array} \right. \]


library(igraph)
g<-barabasi.game(10000, power=5)
# generate a power distribution 1/x^(alpha+1), between the value min and max
generatePower<- function(nmax, alpha, min=1, max=Inf) {
	u<-runif(nmax) # alpha<-1; min=1; max=100; u<-.001
	norm<-(min^-alpha - max^-alpha)
	(min^-alpha - u*norm)^-(1/alpha)
}

hist<-hist(generatePower(100000,1, max=10), breaks=100, plot = FALSE)
plot( hist$mids, hist$intensities + 1e-6, log="xy", col=1, pch=19, cex=.3, ylab='frequency', xlab='x value')
plot( hist$mids, cumsum(hist$intensities), log="xy", col=1, pch=19, cex=.3, ylab='frequency', xlab='x value')
plot( hist$mids, rev(cumsum(rev(hist$intensities))), log="xy", col=1, pch=19, cex=.3, ylab='frequency', xlab='x value')

\end{comment}


\noindent One important distinction between email networks and networks of scientific collaborations is that the former are directed whereas the latter are not. This makes the study of email communication useful because directionality of email ties makes it possible to test reciprocity, and since `reciprocal services' are related to the `strength of ties' \citep{Friedkin1980, granovetter1973}, it is possible to put equation \ref{FrmWeightedRecipients} to the test by comparing groups of ties with similar tie-weights, and examining whether an increasing average of tie-weights is correlated with an increasing proportion of reciprocity.

To achieve this, the original $68,409$ directed arcs were ordered in increasing weight and divided into $50$ bins, each bin consisting of nearly the same number of arcs (about $1368$ arcs in each bin) of equally ranked strength. The proportion of reciprocated ties was calculated within each group and was used as a response variable in a simple logistic regression model, where the explanatory variable was the weight ranking of the ties.


\begin{eqnarray}
 y_{i} &\sim& \hbox{Binomial }(n_i,\pi_{i}) \nonumber \\
 \pi_{i} &=&  \mathbb{E}\left[\frac{y_i}{n_i} \vert X_i \right]  \nonumber \\
 \logit \left(\mathbb{E}\left[\frac{y_i}{n_i} \vert X_i \right] \right) &=& \beta x_i \nonumber \\
\end{eqnarray}


Where $y_i$ is the number of reciprocated ties in bin $i$, $n_i$ is the number of ties (so that $\frac{y_i}{n_i}$ is the ratio between them), $x_i$ is the rank of the tie strength (the bin number) and the regression estimates the value of $\beta$. The fitted model was significant at the $0.001$ level, with an increase of one rank in tie strength explaining an increase of $13\%$ in the log odds for tie reciprocation (see figure \ref{ImgRecipStrength}). Similar relationships were found when emails were limited only to those with a number of recipients below $20$ or even $15$ recipients per email.

\figuremacroWH{ImgRecipStrength}{Ratio of reciprocity regressed on the strength of the tie}{Reciprocity is explained by the ranking of tie strength in comparable sized sub-networks}{.6}

Note that if an arc is reciprocated with another arc of very different strength, they would fall into different bins and would both count as asymmetric ties. Thus we are not testing reciprocity per-se but mutuality \citep[p 40]{monge2003}, i.e., the extent to which a directed tie is reciprocated by a tie with the same rank of strength. Mutuality is of course a stronger version of reciprocity. Note also that the weights calculated in equation \ref{FrmWeightedRecipients} increase with an increasing number of emails sent between two actors and decreases with an increasing average number of recipients per email. To rule out the possibility that reciprocity was explained mainly by the relative rate of emails exchanged between two actors, it was important to test the contribution of recipient number to the effect on reciprocity.

To this end, the email communication dataset was reshuffled and then compared to its original version. The reshuffling proceeded in the following manner: all outgoing emails sent by each user were identified and grouped together. For each group, recipients of distinct emails were swapped at random. As a result, users who received mostly private messages in the original network could now be found in emails with sizable recipient lists (and vice versa). However, the reshuffling process did not change most of the dataset's global properties: the number of ties, the number of users and emails all stayed the same, as well as the distribution of production, consumption and dissemination of emails. From the point of view of traditional approach to network construction from email communication data, nothing has changed in the network model.


Nevertheless, the reshuffling process had a substantial effect on the weights of ties defined in equation \ref{FrmWeightedRecipients}. When comparing the correlation between tie weight and reciprocity, both networks exhibit significant correlations, because the relative frequency of emails sent between two nodes is significantly correlated with reciprocity. However, the variation of network reciprocity explained by tie-weights decreases substantially in the reshuffled network, indicating that the weights calculated from the original network better explain patterns of reciprocity.


To further test the utility of the equation \ref{FrmWeightedRecipients}, another test was carried out to confirm the strength of weak ties hypothesis \citep{granovetter1973}. According to this hypothesis, strong ties are embedded in a tightly knit environment. If this mechanism is at work in the dataset, we would expect that two individuals connected by stronger ties would tend to have more mutual contacts as compared to indivduals connected by weaker ties. The ratio of mutual contacts $M_{ij}$ in the neighborhood of two connected individuals $i \mbox{ and } j$ can be quantified according to an equation suggested by  \citet{onnela2007}, an equation used in a similar context, namelly the confirmation of the strength of weak ties hypothesis in a study of mobile phone transactions: % \ref{FrmOverlap}, 


\begin{equation}\label{FrmOverlap}
M_{ij} = \frac{n_{ij}}{\left( k_i-1 \right) + \left( k_j - 1 \right) - n_{ij} }
\end{equation}


\noindent Where $n_{ij}$ is the number of common neighbours of $i \text{ and } j$, and $k_i \left(k_j \right) $ denote the degree of nodes $i \left(j \right) $. If the two nodes share no common neighbors, then $n_{ij}=0 \text{ and therefore } M_{ij} = 0$. Otherwise, since their degrees are $k_i \text{ and } k_j $ respectively and since they are connected to each other, they have $k_i-1 \text{ and } k_j-1 $ spare degrees for other contacts. If all their contacts are mutual we have $n_{ij} = k_i-1 = k_j-1$ and $M_{ij}=1$. An attempt to verify the strength of weak tie hypothesis by correlating the strength of the tie\footnote{The strength of the tie was calculated as the average of the strength of both arcs  $\frac{w_{ij} + w_{ji}}{2}$ where $w_{ij}$ was calculated according to  \ref{FrmWeightedRecipients}.} against the overlap has failed to show significance. This means that the hypothesis could not be rejected, that tie strength and high levels of mutual contacts are correlated. The strength of weak ties hypothesis in this particular dataset could not be confirmed. 




%The evidence presented in this section reveals that the size of the list of recipients in emails is inversely related to the likelihood of reciprocated ties. This relation could be incorporated into the network model by calculating the weight of ties using equation \ref{FrmWeightedRecipients}. The result is a network model that better captures the original dataset. Let us now turn to another property associated with tie strength, namely the clustering coefficient.


\section{Summary and Reflections}
On one level, the findings in presented in this chapter are self-evident for anyone who uses emails. People feel more compelled to reply to a private, personal email than to a broadcast email, simply because a personal email is often an invitation to reply. I think that few would find it surprising that incoming broadcast emails are less likely to elicit a reply than broadcast emails.

It is also a feature of email client software that enables users to  \textit{reply all} to an incoming message. It is enough that a few of the recipients of a broadcast email will hit \textit{reply all}, that the transitivity of the group will increase. 

There is nothing too surprising in the finding that recipient number is a confounding variable that correlates with increasing transitivity and decreasing reciprocity. What may be  surprising is that taken together, this pattern effectively violates the strength of weak ties hypothesis. Here is are the summary of the empirical findings: 
\begin{enumerate}
\item \textit{Degree Distribution.} As is commonly the case in social networks, all degree distributions are positively skewed. However, the degree distributions of broadcast emails are much more skewed than that of private ones, and the out-degree distributions of broadcast emails are much more skewed than the in-degree distributions, an effect that increases as the number of recipients grow. 
\item \textit{Reciprocity.} Private emails contribute to reciprocity in the aggregate network, much more than broadcast emails. 
\item \textit{Transitivity.} Broadcast emails contribute to transitivity in the aggregate network, much more than private ones. 
\end{enumerate}

Since the traditional methods of constructing social networks from email datasets conceal part of the information in the original dataset, this chapter explores a method to incorporate some of the information through the strength of the tie. The utility of the method is verified against a theoretical proposition: first, the strength of the tie is found to be significantly correlated with  reciprocation, which is what we would expect.  

% ----------------------- end of thesis sub-document ------------------------
% --------------------------------------------------------------------------- 1